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Abstract 




A mixture of Gaussians fit to a single curved 
or heavy-tailed cluster will report that the 
data contains many clusters. To produce 
more appropriate clusterings, we introduce 
a model which warps a latent mixture of 
Gaussians to produce nonparametric cluster 
shapes. The possibly low-dimensional latent 
mixture model allows us to summarize the 
properties of the high-dimensional clusters 
(or density manifolds) describing the data. 
The number of manifolds, as well as the shape 
and dimension of each manifold is automat- 
ically inferred. We derive a simple inference 
scheme for this model which analytically inte- 
grates out both the mixture parameters and 
the warping function. We show that our 
model is effective for density estimation, per- 
forms better than infinite Gaussian mixture 
models at recovering the true number of clus- 
ters, and produces interpret able summaries 
of high-dimensional datasets. 

1 Introduction 

Probabilistic mixture models are often used for cluster- 
ing. However, if the mixture components are paramet- 
ric (e.g. Gaussian), then the clustering obtained can 
be heavily dependent on how well each actual clus- 
ter can be modeled by a Gaussian. For example, a 
heavy tailed or curved cluster may need many compo- 
nents to model it. Thus, although mixture models are 
widely used for probabilistic clustering, their assump- 
tions are generally inappropriate if the primary goal 
is to discover clusters in data. Dirichlet process mix- 
ture models can alleviate the problem of an unknown 
number of clusters, but this does not address the prob- 
lem that real clusters may not be well matched by any 
parametric density. 



Latent space Observed space 

Figure 1: A sample from the iWMM prior. Left: 
In the latent space, a mixture distribution is sampled 
from a Dirichlet process mixture of Gaussians. Right: 
The latent mixture is smoothly warped to produce 
non-Gaussian manifolds in the observed space. 

In this paper, we propose a nonparametric Bayesian 
model that can find nonlinearly separable clusters with 
complex shapes. The proposed model assumes that 
each observation has coordinates in a latent space, and 
is generated by warping the latent coordinates via a 
nonlinear function from the latent space to the ob- 
served space. By this warping, complex shapes in the 
observed space can be modeled by simpler shapes in 
the latent space. In the latent space, we assume an 
infinite Gaussian mixture model [1], which allows us 
to automatically infer the number of clusters. For the 
prior on the nonlinear mapping function, we use Gaus- 
sian processes [2] , which enable us to flexibly infer the 
nonlinear warping function from the data. We call 
the proposed model the infinite warped mixture model 
(iWMM) . Figure 1 shows a set of manifolds and data- 
points sampled from the prior defined by this model. 

To our knowledge this is the first probabilistic gener- 
ative model for clustering with flexible nonparametric 
component densities. Since the proposed model is gen- 
erative, it can be used for density estimation as well as 
clustering. It can also be extended to handle missing 
data, integrate with other probabilistic models, and 



use other families of distributions for the latent com- 
ponents. 

We derive an inference procedure for the iWMM based 
on Markov chain Monte Carlo (MCMC). In partic- 
ular, we sample the cluster assignments using Gibbs 
sampling, sample the latent coordinates using hybrid 
Monte Carlo, and analytically integrate out both the 
mixture parameters (weights, means and covariance 
matrices), and the nonlinear warping function. 

2 Gaussian Process Latent Variable 
Model 

In this section, we give a brief introduction to the 
GPLVM, which can be viewed as a special case of the 
iWMM. The GPLVM is a probabilistic model of non- 
linear manifolds. While not typically thought of as a 
density model, the GPLVM does in fact define a pos- 
terior density over observations [3]. It does this by 
smoothly warping a single, isotropic Gaussian density 
in the latent space into a more complicated distribu- 
tion in the observed space. 

Suppose that we have a set of observations Y = 
(yij'*' iYn) T , where y n G M D , and they are as- 
sociated with a set of latent coordinates X = 
(xi, • • • , xat) t , where x n G M9 . The GPLVM assumes 
that observations are generated by mapping the latent 
coordinates through a set of smooth functions, over 
which Gaussian process priors are placed. Under the 
GPLVM, the probability of observations given the la- 
tent coordinates, integrating out the mapping func- 
tions, is 

p(Y|X, 0) = (2tt)-^ |K|-* exp ^— ^tr(Y T K _1 Y) 

(1) 

where K is the N x N covariance matrix defined by 
the kernel function fc(x n ,x m ), and is the kernel hy- 
perparameter vector. In this paper, we use an RBF 
kernel with an additive noise term: 



fc(x n ,x m ) aexp y-7^p(*n ~ x m ) T (x n - x m ) 

+ s nm p~ 1 . 



(2) 



This likelihood is simply the product of D indepen- 
dent Gaussian process likelihoods, one for each output 
dimension. 

Typically, the GPLVM is used for dimensionality re- 
duction or visualization, and the latent coordinates are 
determined by maximizing the posterior probability of 
the latent coordinates, while integrating out the warp- 
ing function. In that setting, the Gaussian prior den- 
sity on x is essentially a regularizer which keeps the 



latent coordinates from spreading arbitrarily far apart. 
In contrast, we instead integrate out the latent coor- 
dinates as well as the warping function, and place a 
more flexible parameterization on p(x) than a single 
isotropic Gaussian. 

Just as the GPLVM can be viewed as a manifold learn- 
ing algorithm, the iWMM can be viewed as learning a 
set of manifolds, one for each cluster. 

3 Infinite Warped Mixture Model 

In this section, we define in detail the infinite warped 
mixture model (iWMM). In the same way as the 
GPLVM, the iWMM assumes a set of latent coordi- 
nates and a smooth, nonlinear mapping from the latent 
space to the observed space. In addition, the iWMM 
assumes that the latent coordinates are generated from 
a Dirichlet process mixture model. In particular, we 
use the following infinite Gaussian mixture model, 

oo 

p(x|{A c , Mc , R c }) = AcAT(x|/x c , R- 1 ), (3) 

c=l 

where A c , i± c and R c is the mixture weight, mean, and 
precision matrix of the c th mixture component. We 
place Gaussian- Wishart priors on the Gaussian param- 
eters {/[x c ,R c }, 

p(Ax c ,R c ) =A^(Ax c |u,(rR c )- 1 )W(R c |S- 1 ,^), (4) 

where u is the mean of \i c , r is the relative precision 
of /j, c , S _1 is the scale matrix for R c , and v is the 
number of degrees of freedom for R c . The Wishart 
distribution is defined as follows: 



W(R|S" 



R|"^" exp ( -Jtr(SR) ) , (5) 



where G is the normalizing constant. Because we use 
conjugate Gaussian- Wishart priors for the parameters 
of the Gaussian mixture components, we can analyti- 
cally integrate out those parameters, given the assign- 
ments of points to components. Let z n be the latent as- 
signment of the n th point. The probability of latent co- 
ordinates X given latent assignments Z = (zi, • • • , zn) 
is obtained by integrating out the Gaussian parame- 
ters {/L£ C ,R C } as follows: 



p(X|Z,S,i/,r) = Y[tt 



n c q r 



Q/2|g|i//2 



c=l 



r? /2 |S c h/ 2 



n 



(6) 



where N c is the number of data points assigned to the 
c th component, T(-) is Gamma function, and 



■Nr. 



Nr. 



^ x n x^ + ruu T - r c ii c iij, (7) 



are the posterior Gaussian- Wishart parameters of the 
c th component. We use a Dirichlet process with con- 
centration parameter rj for infinite mixture model- 
ing [4] in the latent space. Then, the probability of 
Z is given as follows: 



p(Z\rj) 



v c Uc=i(N c -iy- 

77(77 + 1) • • • (77 + -/V — 1) ' 



(8) 



where C is the number of components for which iV c > 
0. The joint distribution is given by 

p(Y,X,Z|0,,S>,u,r,77) 
= p(Y|X, 0)p(X|Z, S, v, xx, r)p(Z\ V ), (9) 

where factors in the right hand side can be calculated 
by (1), (6) and (8), respectively. 

In summary, the infinite warped mixture model gen- 
erates observations Y according to the following gen- 
erative process: 

1. Draw mixture weights A ~ GEM (77) 

2. For each component c = 1, • • • ,00 

(a) Draw precision R c ~ W(S _1 , v) 

(b) Draw mean \i c ~ A/"(u, (rR c ) _1 ) 

3. For each observed dimension d = 1, • • • , D 

(a) Draw function /^(x) ~ GP(m(x), &(x, x')) 

4. For each observation n = 1, • • • , A/" 

(a) Draw latent assignment z n ~ Mult (A) 

(b) Draw latent coordinates x n ~ A/*(/x Zn , R^ 1 ) 

(c) For each observed dimension d = 1, • • • , D 

i. Draw feature y nd ~ N (f d (* n ) , ft' 1 ) 

Here, GEM(^) is the stick-breaking process [5] that 
generates mixture weights for a Dirichlet process with 
parameter 77, Mult (A) represents a multinomial distri- 
bution with parameter A, m(x) is the mean function 
of the Gaussian process, and x, x' G M9 . Figure 2 
shows the graphical model representation of the pro- 
posed model. Here, we assume a Gaussian for the 
mixture component, although we could in principle use 
other distributions such as Student's t-distribution or 
the Laplace distribution. 

The iWMM can be seen as a generalization of either 
the GPLVM or the infinite Gaussian mixture model 




Figure 2: A graphical model representation of the infi- 
nite warped mixture model, where the shaded and un- 
shaded nodes indicate observed and latent variables, 
respectively, and plates indicate repetition. 

(iGMM). To be precise, the iWMM with a single fixed 
spherical Gaussian density on the latent coordinates 
corresponds to the GPLVM, while the iWMM with 
fixed direct mapping function /d(x) = x d and Q = D 
corresponds to the iGMM. 

The iWMM offers attractive properties that do not ex- 
ist in other probabilistic models; principally, the abil- 
ity to model clusters with nonparametric densities, and 
to infer a seperate dimension for manifold. 

4 Inference 

We infer the posterior distribution of the latent co- 
ordinates X and cluster assignments Z using Markov 
chain Monte Carlo (MCMC). In particular, we alter- 
nate collapsed Gibbs sampling of Z, and hybrid Monte 
Carlo sampling of X. Given X, we can efficiently sam- 
ple Z using collapsed Gibbs sampling, integrating out 
the mixture parameters. Given Z, we can calculate the 
gradient of the unnormalized posterior distribution of 
X, integrating over warping functions. This gradient 
allows us to sample X using hybrid Monte Carlo. 

First, we explain collapsed Gibbs sampling for Z. 
Given a sample of X, p(Z|X, S, ^, u, r, 77) does not de- 
pend on Y. This lets resample cluster assignments, 
integrating out the iGMM likelihood in close form. 
Given the current state of all but one latent component 
z ni a new value for z n is sampled from the following 
probability: 

p(z n = c|X,Z\ n ,S>,u,r,77) 

N c \ n • p(x n |X c \ n , S, is, u, r) existing components 



77 -p(x n |S>,u,r) 



a new component 



(10) 



where X c = {x n |z n = c} is the set of latent coordi- 
nates assigned to the c th component, and \n represents 
the value or set when excluding the n th data point. We 



can analytically calculate p(x n |X c \ n , S, u, r) as fol- 
lows: 



p(x n |X c \ n ,S>,u,r) 



'c\n l°c\n' 



d=l 1 V 2 J 



where 



u' c and S' c represent the posterior 



Gaussian- Wishart parameters of the c th component 
when the n th data point is assigned to the c th com- 
ponent. We can efficiently calculate the determinant 
by using the rank one Cholesky update. In the same 
way, we can analytically calculate the likelihood for a 
new component p(x n \S, u, r). 

Hybrid Monte Carlo (HMC) sampling of X from pos- 
terior p(X|Z, Y, 0, i/, u, r), requires computing the 
gradient of the log of the unnormalized posterior 
logp(Y|X, 0) + logp(X|Z, S, i/, u, r). The first term 
of the gradient can be calculated by 



dlogp(Y|X,0) _^ 1 . 



and 

dk(x n: x m ) 



-K" YY K 



(12) 



<9x n 



exp 



2£ 2 



T (x„ 



(13) 



using the chain rule. The second term can be calcu- 
lated as follows: 



dlogp(X\Z,S,v,u,r) 
dx n 



Zn S-\ Xn -u Zn ). (14) 



We also infer kernel hyperparameters = {a, /3, £} 
via HMC, using the gradient of the log unnormalized 
posterior with respect to the kernel hyperparameters. 
The complexity of each iteration of HMC is dominated 
by the O (TV 3 ) computation of K 1 1 . 

In summary, we obtain samples from the posterior 
p(X, Z|Y, 0, S, ^, u, r, rj) by iterating the following pro- 
cedures: 

1. For each observation n = 1, • • • , A/", sample the 
component assignment z n by collapsed Gibbs 
sampling (10). 

2. Sample latent coordinates X and kernel parame- 
ters using hybrid Monte Carlo. 



x This complexity could be improved by making use of 
an inducing point approximation such as [6, 7] 



4.1 Posterior Predictive Density 

In the GP-LVM, the predictive density of at test point 
y* is usually computed by finding the point x* which 
which is most likely to be mapped to y*, then using 
the density of p(x*) and the Jacobian of the warping 
at that point to approximately compute the density at 
y*. When inference is done by simply optimizing the 
location of the latent points, this estimation method 
simply requires solving a single optimization for each 

y*- 

For our model, we use approximate integration to esti- 
mate p(y*). This is done for two reasons: First, multi- 
ple latent points (possibly from different clusters) can 
map to the same observed point, meaning the standard 
method can underestimate p(y*). Second, because we 
do not optimize the latent coordinates but rather sam- 
ple them, we would need to perform optimizations for 
each p(y*) seperately for each sample. Our method 
gives estimates for all p(y*) at once, but may not be 
accurate in very high dimensions. 

The posterior density in the observed space given the 
training data is simply: 

p(y*|Y) 

= J J p(y*,x»,X|Y)dx*dX 

= ^p(y,|x, ) X,Y)p(x,|X,Y)p(X|Y)dx,dX. 

(15) 

We approximate p(X|Y) using the samples from the 
Gibbs and hybrid Monte Carlo samplers. We approx- 
imate p(x*|X, Y) by sampling points from the latent 
mixture and warping them, using the following proce- 
dure: 



1. Draw latent assignment 



z.-Mult^, 



2. Draw precision matrix 
R* ~ W(S-> I( ) 

3. Draw mean 

^ ~ A/"(u^, (r^R*) -1 ) 

4. Draw latent coordinates 
x* ~ A/^/l^R" 1 ) 

When a new component C + 1 is assigned to z*, the 
prior Gaussian- Wishart distribution is used for sam- 
pling in steps 2 and 3. The first factor of (15) can be 
calculated by 



p(y*|x*,X, Y) 

= A/^K^Y, fc(x*, x*) - k^K^k*), 



(16) 



where k* = (fc(x*, xi), • • • , fc(x*, xa/-)) t . Each step of 
this procedure is exact, and since the observations y* 
are conditionally normally distributed, each one adds 
a smooth contribution to the empirical Monte Carlo 
estimate of the posterior density, as opposed to a col- 
lection of point masses. This procedure was used to 
generate the plots of posterior density in figures 1, 4, 
and 6. 

5 Related work 

The GPLVM is effective as a nonlinear latent vari- 
able model in a wide variety of applications [8, 9, 10]. 
The latent positions X in the GPLVM are typically 
obtained by maximum a posteriori estimation or vari- 
ational Bayesian inference [11], placing a single fixed 
spherical Gaussian prior on x. A prior which penalizes 
a high-dimensional latent space is introduced by [12], 
in which the latent variables and their intrinsic dimen- 
sionality are simultaneously optimized. The iWMM 
can also infer the intrinsic dimensionality of nonlinear 
manifolds: inferring the Gaussian covariance for each 
latent cluster allows the variance of irrelevant dimen- 
sions to become small. Because each latent cluster has 
a different set of parameters, the effective dimension 
of each cluster can vary, allowing manifolds of differ- 
ent dimension in the observed space. This ability is 
demonstrated in figure 4b. 

The iWMM can also be viewed as a generalization of 
the mixture of probabilistic principle component ana- 
lyzers [13], or mixture of factor analyzers [14], where 
the linear mapping of the mixtures is generalized to a 
nonlinear mapping by Gaussian processes, and number 
of components is infinite. 

There exist non-probabilistic clustering methods 
which can find clusters with complex shapes, such as 
spectral clustering [15] and nonlinear manifold clus- 
tering [16, 17]. Spectral clustering finds clusters by 
first forming a similarity graph, then finding a low- 
dimensional latent representation using the graph, and 
finally, clustering the latent coordinates via k-means. 
The performance of spectral clustering depends on pa- 
rameters which are usually set manually, such as the 
number of clusters, the number of neighbors, and the 
variance parameter used for constructing the similar- 
ity graph. In contrast, the iWMM infers such parame- 
ters automatically. One of the main advantages of the 
iWMM over these methods is that there is no need to 
construct a similarity graph. 

The kernel Gaussian mixture model [18] can also find 
non-Gaussian shaped clusters. This model estimates 
a GMM in the implicit high-dimensional feature space 
defined by the kernel mapping of the observed space. 
However, the kernel GMM uses a fixed nonlinear map- 




Figure 3: A sample from the 2-dimensional latent 
space when modeling a series of 32x32 face images. 
Our model correctly discovers that the data consists 
of two seperate manifolds, both approximately one- 
dimensional, which share the same head-turning struc- 
ture. 

ping function, with no guarantee that the latent points 
will be well-modeled by a GMM. In contrast, the 
iWMM infers the mapping function such that the la- 
tent co-ordinates will be well-modeled by a mixture of 
Gaussians. 

6 Experimental results 
6.1 Clustering Faces 

We first examined our model's ability to model im- 
ages without pre-processing. We constructed a dataset 
consisting of 50 greyscale 32x32 pixel images of two 
individuals from the UMIST faces dataset [19]. Both 
series of images capture a person turning his head to 
the right. Figure 3 shows a sample from the posterior 
over the latent coordinates and density model. The 
model has recovered three relevant, interpretable fea- 
tures of the dataset. First, that there are two dis- 
tinct faces. Second, that each set of images lies ap- 
proximately along a smooth one-dimensional manifold. 
Third, that the two manifolds share roughly the same 
structure: the front-facing images of both individuals 
lie close to one another, as do the side- facing images. 



Observed space 




Latent space 



(a) 2-curve (b) 3-semi (c) 2-circle (d) Pinwheel 

Figure 4: Top row: The observed, unlabeled data points, and the clusters inferred by the iWMM. Bottom row: 
Latent coordinates and Gaussian components, shown for a single sample from the posterior. Each point in the 
latent space corresponds to a point in the observed space. This figure is best viewed in color. 



6.2 Synthetic Datasets 

Next, we demonstrate the proposed model on the four 
synthetic datasets shown in Figure 4. None of these 
four datasets can be appropriately clustered by Gaus- 
sian mixture models (GMM). For example, consider 
the 2-curve data shown in Figure 4 (a), where 100 
data points lie in one of two curved lines in a two- 
dimensional observed space. A GMM with two com- 
ponents cannot separate the two curved lines, while a 
GMM with many components could separate the two 
lines only by breaking each line into many clusters. 
In contrast, with the iWMM, the two non- Gaussian- 
shaped clusters in the observed space were represented 
by two Gaussian-shaped clusters in the latent space, as 
shown at the bottom row of Figure 4 (a). The iWMM 
separated the two curved lines by nonlinearly warping 
two Gaussians from the latent space to the observed 
space. 

Figure 4 (c) shows an interesting manifold learning 
challenge: a dataset consisting of two circles. The 
outer circle is modeled in the latent space by a Gaus- 
sian with effectively one degree of freedom. This linear 
topology fits the outer circle in the observed space by 
bending the two ends until they overlap. In contrast, 
the sampler fails to discover the ID topology of the 



inner circle, modeling it with a 2D manifold instead. 
This example demonstrates that each cluster in the 
iWMM manifold can have a different effective dimen- 
sion. 

6.3 Mixing 

An interesting side-effect of learning the number of la- 
tent clusters is that this added flexibility can help the 
sampler escape local minima, helping the sampler to 
mix properly. Figure 5 shows the samples of the latent 
coordinates and clusters of the iWMM over time, when 
modeling the 2-curve data. 5(a) shows the latent coor- 
dinates initialized at the observed coordinates, start- 
ing with one latent component. At the 500th iteration 
5(b), each curved line is modeled by two components. 
At the 1800th iteration 5(c), the left curved line is 
modeled by a single component. At the 3000th iter- 
ation 5(d), the right curved line is also modeled by 
a single component, and the dataset is appropriately 
clustered. This configuration was relatively stable, and 
a similar state was found at the 5000th iteration. 

6.4 Density Estimation 

Figure 6 (a) shows the posterior density in the ob- 
served space inferred by the iWMM on the 2-curve 





























/■m&yJm') / 

\vj 






/ X \ \ 

( I'm.' 




/yffi / X) 

• Mil l,.;y 




/ 

/ 

I 






(a)l 


(b) 500 




(c) 1800 


(d) 3000 



Figure 5: The inferred infinite GMMs over iterations in the two-dimensional latent space with the iWMM using 
the 2-curve data. Labels indicate the number of iterations of the sampler, and the color of each point represents 
its ordering in the observed coordinates. 




(a) iWMM 



(b) iWMM (C 



Figure 6: The posterior density in the observed space 
with the 2-curve data inferred by the iWMM (a), and 
that inferred by the iWMM with one component (b). 



data, computed using 1000 samples from the Markov 
chain. The two separate manifolds of high density im- 
plied by the two curved lines was recovered by the 
iWMM. Note also that the density along the manifold 
varies with the density of data shown in Figure 4 (a). 
This result can be compared to a special case of our 
model, which uses only a single Gaussian to model the 
latent coordinates instead of an infinite GMM. Fig- 
ure 6 (b) shows that the result of the iWMM with 
(7 = 1, where posterior is forced to place significant 
density connecting the two clusters. Figure 6 (b) shows 
that the single-cluster variant of the iWMM posterior 
is forced to place significant density connecting the two 
clusters. 

6.5 Visualization 

Next, we briefly investigate the potential of the iWMM 
for visualization. Figure 7 (a) shows the latent coor- 
dinates obtained by averaging over 1000 samples from 
the posterior of the iWMM. Because rotating the la- 
tent coordinates does not change their probability, av- 
eraging may not be an adequate way to summarize 
the posterior. However, we show this result in order to 



show the characteristics of latent coordinates obtained 
by the iWMM. The estimated latent coordinates are 
clearly separated, and they form two straight lines. 
This result indicates that in some cases, the iWMM 
can recover the topology of the data before it has been 
warped into a manifold. For comparison, Figure 7 (b) 
shows the latent coordinates estimated by the iWMM 
when forced to use a single cluster: the latent coor- 
dinates lie in two sections of a single straight line. 
Figure 7 (c) and (d) show the latent coordinates esti- 
mated by the GPLVM when optimizing or integrating 
out the latent coordinates, respectively. Recall that 
the iWMM (C = 1) is a more flexible model than the 
GPLVM, since the GPLVM enforces a spherical covari- 
ance in the latent space. These methods did not unfold 
the two curved lines, since the effective dimension of 
their latent representation is fixed beforehand. In con- 
trast, the iWMM effectively formed a low-dimensional 
representation in the latent space. 

Regardless of the dimension of the latent space, the 
iWMM will tend to model each cluster with as low- 
dimensional a Gaussian as possible. This is because, 
if the data in a cluster can be made to lie in a low- 
dimensional plane, a narrowly-shaped Gaussian will 
assign the latent coordinates much higher likelihood 
than a spherical Gaussian. 

6.6 Clustering Performance 

We more formally evaluated the density estimation 
and clustering performance of the proposed model us- 
ing four real datasets: iris, glass, wine and vowel, ob- 
tained from LIBSVM multi-class datasets [20], in ad- 
dition to the four synthetic datasets shown above: 2- 
curve, 3-semi, 2-circle and Pinwheel [21]. The statis- 
tics of these datasets are summarized in Table 1. In 
each experiment, we show the results of ten- fold cross- 
validation. Results in bold are not significantly differ- 
ent from the best performing method in each column 



(a) iWMM (b) iWMM (C = 1) (c) GPLVM (d) BGPLVM 

Figure 7: The estimated latent coordinates of the 2-curve data by (a) iWMM, (b) iWMM (C = 1), (c) GPLVM, 
and (d) Bayesian GPLVM. 



Table 1: The statistics of datasets used for evaluation. 





2-curve 


3-semi 


2-circle 


Pinwheel 


Iris 


Glass 


Wine 


Vowel 


number of samples: N 


100 


300 


100 


250 


150 


214 


178 


528 


observed dimensionality: D 


2 


2 


2 


2 


4 


9 


13 


10 


number of clusters: C 


2 


3 


2 


5 


3 


7 


3 


11 



Table 2: Average Rand index for evaluating clustering performance. 







2-curve 


3-semi 


2-circle 


Pinwheel 


Iris 


Glass 


Wine 


Vowel 


iGMM 




0.52 


0.79 


0.83 


0.81 


0.78 


0.60 


0.72 


0.76 


iWMM(Q 


=2) 


0.86 


0.99 


0.89 


0.94 


0.81 


0.65 


0.65 


0.50 


iWMM(Q 


=D) 


0.86 


0.99 


0.89 


0.94 


0.77 


0.62 


0.77 


0.76 



according to a paired t-test. 

Table 2 compares the clustering performance of the 
iWMM with the iGMM, quantified by the Rand in- 
dex [22], which measures the correspondence between 
inferred clusters and true clusters. The iGMM is an- 
other probabilistic generative model commonly used 
for clustering, which can be seen as a special case of the 
iWMM in which the Gaussian clusters are not warped. 
These experiments demonstrate the extent to which 
nonparametric cluster shapes allow a mixture model 
to recover more meaningful clusters. 

Table 3 lists average test log likelihood, compar- 
ing the proposed models with kernel density estima- 
tion (KDE), and the infinite Gaussian mixture model 
(iGMM). In KDE, the kernel width is estimated by 
maximizing the leave-one-out log densities. Since the 
manifold on which the observed data lies can be at 
most D-dimensional, we set the latent dimension Q 
equal to the observed dimension D in iWMMs. We 
also include the Q — 2 case in an attempt to char- 
acterize how much modeling power is lost by forcing 
the latent representation to be visualizable. The pro- 
posed models achieved high test log likelihoods com- 
pared with the KDE and iGMM. 



6.7 Source code 

Code to reproduce all the above experiments 
is available at http : / / github . com/ duvenaud/ 
warped-mixtures. 



7 Future work 

The Dirichlet process mixture of Gaussians in the la- 
tent space of our model could easily be replaced by 
a more sophisticated density model, such as a hier- 
archical Dirichlet process [23], or a Dirichlet diffusion 
tree [24]. Another straightforward extension of our 
model would be making inference more scalable by us- 
ing sparse Gaussian processes [6, 7] or more advanced 
hybrid Monte Carlo methods [25]. An interesting but 
more complex extension of the iWMM would be a 
semi-supervised version of the model. The iWMM 
could allow label propagation along regions of high 
density in the latent space, even if those regions were 
stretched along low-dimensional manifolds in the ob- 
served space. Another natural extension would be to 
allow a separate warping for each cluster, which would 
also improve inference speed. 



Table 3: Average test log likelihood for evaluating density estimation performance. 

2-curve 3-semi 2-circle Pinwheel Iris Glass Wine Vowel 
KDE -2.47 -0.38 -1.92 -1.47 -1.87 L26 -2.73 6.06 

iGMM -3.28 -2.26 -2.21 -2.12 -1.91 3.00 -1.87 -0.67 

iWMM(Q=2) -0.90 -0.18 -1.02 -0.79 -1.88 5.76 -1.96 5.91 
iWMM(Q=D) -0.90 -0.18 -1.02 -0.79 -1.71 5.70 -3.14 -0.35 



8 Conclusion 

In this paper, we introduced a simple generative model 
of non- Gaussian density manifolds which can infer 
nonlinearly separable clusters, low-dimensional repre- 
sentations of varying dimension per cluster, and den- 
sity estimates which smoothly follow data contours. 
We then introduced an efficient sampler for this model 
which integrates out both the cluster parameters and 
the warping function exactly. We further demon- 
strated that allowing non-parametric cluster shapes 
improves clustering performance over the Dirichlet 
process Mixture of Gaussians. 

Many methods have been proposed which can per- 
form some combination of clustering, manifold learn- 
ing, density estimation and visualization. We demon- 
strated that a simple but flexible probabilistic genera- 
tive model can perform well at all these tasks. 

Acknowledgements 

The authors would like to thank Dominique Perrault- 
Joncas, Carl Edward Rasmussen, and Ryan Prescott 
Adams for helpful discussions. 

References 

[1] C.E. Rasmussen. The infinite Gaussian mixture 
model. Advances in Neural Information Process- 
ing Systems, 12(5. 2):2, 2000. 

[2] C.E. Rasmussen and CKI Williams. Gaussian 
Processes for Machine Learning. The MIT Press, 
Cambridge, MA, USA, 2006. 

[3] H. Nickisch and C. Rasmussen. Gaussian mix- 
ture modeling with Gaussian process latent vari- 
able models. Pattern Recognition, pages 272-282, 
2010. 

[4] S.N. MacEachern and P. Miiller. Estimating mix- 
ture of Dirichlet process models. Journal of Com- 
putational and Graphical Statistics, pages 223- 
238, 1998. 

[5] Jayaram Sethuraman. A constructive definition 
of Dirichlet priors. Statistica Sinica, 4:639-650, 
1994. 

[6] J. Quihonero-Candela and C.E. Rasmussen. A 
unifying view of sparse approximate Gaussian 



process regression. The Journal of Machine 
Learning Research, 6:1939-1959, 2005. 

[7] E. Snelson and Z. Ghahramani. Sparse Gaussian 
processes using pseudo-inputs. Advances in Neu- 
ral Information Processing Systems, 2006. 

[8] N.D. Lawrence. Gaussian process latent variable 
models for visualisation of high dimensional data. 
Advances in Neural Information Processing Sys- 
tems, 16:329-336, 2004. 

[9] M. Salzmann, R. Urtasun, and P. Fua. Local de- 
formation models for monocular 3D shape recov- 
ery. In IEEE Conference on Computer Vision and 
Pattern Recognition, CVPR, pages 1-8, 2008. 

[10] N.D. Lawrence and R. Urtasun. Non-linear ma- 
trix factorization with Gaussian processes. In 
Proceedings of the 26th Annual International 
Conference on Machine Learning, pages 601-608. 
ACM, 2009. 

[11] M. Titsias and N. Lawrence. Bayesian Gaussian 
process latent variable model. AISTATS, 2010. 

[12] A. Geiger, R. Urtasun, and T. Darrell. Rank 
priors for continuous non-linear dimensionality 
reduction. In IEEE Conference on Computer 
Vision and Pattern Recognition (CVPR), pages 
880-887. IEEE, 2009. 

[13] M.E. Tipping and CM. Bishop. Mixtures of prob- 
abilistic principal component analyzers. Neural 
computation, ll(2):443-482, 1999. 

[14] Z. Ghahramani and M.J. Beal. Variational in- 
ference for Bayesian mixtures of factor analysers. 
Advances in Neural Information Processing Sys- 
tems, 12:449-455, 2000. 

[15] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral 
clustering: Analysis and an algorithm. Advances 
in Neural Information Processing Systems, 2:849- 
856, 2002. 

[16] W. Cao and R. Haralick. Nonlinear manifold clus- 
tering by dimensionality. In International Confer- 
ence on Pattern Recognition (ICPR), volume 1, 
pages 920-924. IEEE, 2006. 

[17] Ehsan Elhamifar and Rene Vidal. Sparse man- 
ifold clustering and embedding. In Advances 
in Neural Information Processing Systems, pages 
55-63, 2011. 



[18] J. Wang, J. Lee, and C. Zhang. Kernel trick em- 
bedded Gaussian mixture model. In Algorithmic 
Learning Theory, pages 159-174. Springer, 2003. 

[19] Daniel B Graham and Nigel M Allinson. Char- 
acterizing virtual eigensignatures for general pur- 
pose face recognition. Face Recognition: From 
Theory to Applications, 163:446-456, 1998. 

[20] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A 
library for support vector machines. ACM Trans. 
Intell. Syst. Technol, 2(3):27:l-27:27, 2011. 

[21] R.P. Adams and Z. Ghahramani. Archipelago: 
nonparametric Bayesian semi-supervised learn- 
ing. In Proceedings of the 26th Annual Inter- 
national Conference on Machine Learning. ACM, 
2009. 

[22] W.M. Rand. Objective criteria for the evaluation 
of clustering methods. Journal of the American 
Statistical association, pages 846-850, 1971. 

[23] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. 
Hierarchical dirichlet processes. Journal of the 
American Statistical Association, 101 (476): 1566- 
1581, 2006. 

[24] R.M. Neal. Density modeling and clustering us- 
ing dirichlet diffusion trees. Bayesian Statistics, 
7:619-629, 2003. 

[25] Y. Zhang and C. Sutton. Quasi- Newton Markov 
chain Monte Carlo. Advances in Neural Informa- 
tion Processing Systems, pages 2393-2401, 2011. 



